GenieAI runs regular internal studies to understand what drives high-quality legal output, pushing the boundaries of Genie's own legal accuracy and benchmarking the platform's capabilities against other AI providers.
To make this data trustworthy, we designed the benchmark to be as controlled and repeatable as possible:
Same case, same evidence, same prompt: Every system receives the identical prompt and 65-document bundle, so differences in scores come from output quality rather than input advantages.
Broad, realistic test set: The source pack spans 65 simulated documents across multiple document types (e.g. contracts, board minutes, financial statements, regulatory filings, etc) to reflect the cross-referencing demands of real legal work.
Pre-defined scoring framework: Outputs are evaluated across 15 clearly defined legal-quality metrics, each scored 1–10 (maximum 150). This reduces “moving goalposts” and keeps comparisons consistent across runs.
Evidence-led grading: Where a system makes claims, we check whether they are supported by the underlying documents (e.g. specific figures, dates, contract clauses, regulatory obligations). Higher scores require traceable support.
Separation of “analysis” vs “speculation”: The rubric rewards accurate synthesis and properly qualified uncertainty, and penalizes confident extrapolations that aren’t grounded in the documents.
Reproducible methodology: Because the scenario, document set, prompt, and rubric are fixed, the test can and is rerun to verify that results are stable over time.
Below is the latest benchmark data from this methodology, based on analysis of 65 simulated documents across a broad variety of document types.
GenieAI
Legal Quality Benchmark · Three-Way
GenieAI vs CoWork vs ChatGPT
A 15-metric evaluation of AI-generated legal risk assessments across 65 source documents in a simulated Tesla European expansion case.
Comprehensive risk assessment covering partnership exposures, regulatory challenges, and strategic objectives with specific financial figures
Prompt
I need to prepare a comprehensive risk assessment document for Tesla's European expansion strategy. Cover: (1) key partnership risks with specific financial exposures and commitments, (2) regulatory challenges with potential revenue impact figures, and (3) strategic objectives from board discussions including production targets. Include specific figures and metrics where available.
Expected Key Points
Board authorized 3 strategic partnerships for European expansion
NexGen: solid-state battery supply, EUR 2.5B+ annual commitment by 2028
AutonomX: autonomous driving for EU market, EUR 250M+ total investment
Board considering QuantumFlux acquisition to reduce NexGen dependency
Type Approval issues could impact EUR 189M-567M in revenue
Strategic objective: 20M vehicles annually by 2030 (Master Plan Part 3)
Overall Scores
15 legal quality metrics, each scored 1-10, max 150
GenieAI
135
90.0% - out of 150
A+
First response across all benchmark runs to reach A+. Seven perfect 10/10 scores. The most comprehensive risk assessment with depth AND breadth.
Best for: Board-grade risk assessment, litigation prep, cross-domain synthesis
CoWork
119
79.3% - out of 150
B+
Competent legal risk assessment with the strongest clause-level analysis and most structured three-tier action plan.
Best for: Structured recommendations, clause-level contractual analysis
ChatGPT
56
37.3% - out of 150
F
Misses QuantumFlux entirely, zero regulatory coverage, 2/8 key points. Presents speculative extrapolations on incorrect base figures as authoritative projections.
Best for: Financial scenario modeling only; insufficient for legal work product
+16
GenieAI vs CoWork
GenieAI leads in 11 of 15 metrics. Gap driven by RAG-based document mining: cross-reference synthesis, financial precision, evidence depth, and counterparty analysis.
+63
CoWork vs ChatGPT
The gap between CoWork and ChatGPT is larger than the gap between F and B+. ChatGPT's regulatory coverage (1/10), key points (2/10), and dispute posture (2/10) are fundamentally insufficient.
01Factual Accuracy
GN
10
CW
8
GPT
6
GenieAI
EUR 11.9B precisely matching First Amendment. Cell pricing per product. NexGen concentration 10%→43.5%. Every figure cross-referenceable.
CoWork
EUR 11.6B total NexGen (minor variance vs 11.9B). Year-by-year schedule consistent. All 4 partnerships covered.
ChatGPT
Core figures correct (EUR 11.9B, EUR 50M advance). BUT introduces incorrect bases: EUR 45K ASP (actual EUR 28.5K-39.5K), EUR 4.7B Berlin and EUR 675M NordischEM derived from wrong ASP, EUR 525M FSD entirely hypothetical.
02Source Attribution
GN
9
CW
8
GPT
5
GenieAI
Contract refs for all 5 partnerships. First Amendment articles. EU Reg articles 7, 8, 48, 77. Board dates. TSL-REG-EU-2023-001.
Document bundle references ("17-internal-documents-combined") - bundle-level, not clause-level. No contract reference numbers. No section citations. No EU regulation articles. No board dates.
03Legal Reasoning
GN
8
CW
8
GPT
4
GenieAI
70-80% FM probability with causal nexus reasoning. Tesla JDA failures acknowledged. Board authorization deviations as governance concern. 12× escalation pattern.
CoWork
FM exclusion under §13.3(c). Bergmann email as concealment evidence. Honest AutonomX assessment. NDA leverage. Non-delegable EU liability.
ChatGPT
FM analysis is binary: "if it fails / if it succeeds." No §13.3(c), no §1.11(e) reasonable care. Concealment noted but no legal implications drawn. No governance analysis. Reads as financial consulting.
10 risks across 4 partnerships. NE-EV1 competitive product. EU Battery Regulation. QuantumFlux tech readiness gap.
ChatGPT
NexGen covered extensively (FM, pricing, lithium corridor). AutonomX thin. NordischEM basic. QuantumFlux COMPLETELY MISSING. Type Approval COMPLETELY MISSING. Battery Regulation COMPLETELY MISSING. Master Plan Part 3 MISSING.
05Evidentiary Quality
GN
9
CW
7
GPT
5
GenieAI
Bergmann email. Schwartz early warning Feb 2023. Board votes (6-0 Musk abstaining; 7-0 First Amendment). EY going concern. Quarterly revenue EUR 98.5M. Patent numbers. Authorization deviation table.
CoWork
Bergmann email with quote. Patent EP 3,456,789 B1. Board votes with counts. Missing: whistleblower allegation, EY going concern flag.
ChatGPT
Concealment quote (slightly different wording). Deposition reference confirming Livent exit. NexGen financials. But no patent numbers, no board votes, no Schwartz warning, no EY flag, no insolvency timeline.
06Regulatory Coverage
GN
10
CW
9
GPT
1
GenieAI
Type Approval: 4 non-conformances, 5 metrics (all FAIL), 3-scenario impact, 3 deadlines. Battery Reg: carbon per supply chain stage, recycled content, Article 77 Digital Passport, EPR targets. NordischEM linkage explicit.
CoWork
Type Approval with 4 findings. Battery Reg with 6 phases through 2031. Carbon footprint. Recycled content. Articles 48, 20.
ChatGPT
Mentions EU Type Approval ONLY for autonomous driving (UN R157). No NordischEM conformity failures, no KBA, no brake pads, no EUR 6.3M/day halt. No Battery Regulation 2023/1542, no carbon footprint, no battery passport. Single largest gap.
07Financial Quantification
GN
10
CW
8
GPT
5
GenieAI
EUR 2.949B-4.312B aggregate with breakdown. NexGen: 800K-1.2M cells, 40K-60K packs, EUR 1.8-2.7B revenue. QuantumFlux EUR 585M with premium analysis. Type Approval 3-scenario. Insurance limits. Monthly burn with insolvency timeline.
CoWork
EUR 1.8-2.7B aggregate. Year-by-year NexGen. AutonomX royalties. Recall costs. Exposure as % of operating income (18-28%).
ChatGPT
Lithium corridor EUR 150M/year (novel). Berlin disruption model. BUT multiple speculative figures (EUR 4.7B Berlin, EUR 525M FSD, EUR 675M NordischEM) on wrong bases. No aggregate min/max from actuals. Mixes sourced and hypothetical.
08Cross-Reference
GN
10
CW
7
GPT
3
GenieAI
10 systemic insights. NordischEM→Type Approval. LRI→carbon→Battery Reg. AutonomX+NexGen compounding. Tesla knowledge gap. 12× escalation. Board deviations as governance pattern.
CoWork
Dual-front challenge. NexGen→Battery Reg. NE-EV1→quality. Margin context. Good but less systematic than GN.
ChatGPT
NexGen dependency vs. 4680 ramp noted. Lithium→cost→margin chain. But no cross-partnership connections. No systemic patterns. No governance analysis. Risks in isolated silos.
NexGen: current ratio 0.67, D/EBITDA 3.2x, cash EUR 38.2M. QuantumFlux tech readiness (500/1,500 cycles). NordischEM distraction.
ChatGPT
NexGen: revenue EUR 285.4M, pilot EUR 28.5M, bank debt EUR 185M. Good qualitative "financial fragility" insight. But no ratios, no cash, no burn, no insolvency timeline, no going concern, no covenant analysis.
10Clause Analysis
GN
7
CW
8
GPT
3
GenieAI
Section 5.4 concealment. First Amendment 4.3 (180-day notice). JDA IP notification. NDA auto-assignment with EUR 5M cap. Trades clause depth for commercial context.
CoWork
Broadest coverage: MSA §5.4, §13.3(c); JDA §6.3(b), §6.7; MLA §7.5; NDA §6.5, §9.2(b); QSM §7.3.3; EU Reg Art 20. Identifies Tesla's own JDA failures.
ChatGPT
Mentions "Qualified Alternative Supplier clause" and "Most Favored Licensee" but zero specific section numbers. No FM exclusion clause. No supplier change consent by section. No JDA IP provisions.
11Actionability
GN
7
CW
8
GPT
5
GenieAI
Risk ratings with timelines. Supply contingency with 4 alternatives. AutonomX 4-step escalation dates. But embedded conclusions, not structured playbook.
6 mitigation priorities: activate 180-day clause, maintain CATL fallback, renegotiate multi-source. Practical but no timeline structure, no named counsel, no regulatory actions. Acknowledges output is not board-ready.
12Key Points
GN
10
CW
9
GPT
2
GenieAI
8/8 with superior precision. Board Sep 15 2022, 6-0 vote. NordischEM 100K-175K range. AutonomX EUR 250M+ decomposed. Master Plan Part 3 with March 2023 date.
CoWork
8/8. All expected key points including NordischEM 100K and Master Plan Part 3.
ChatGPT
~2/8. Board partnerships (partial, no votes/dates). NexGen dependency. MISSING: EUR 2.5B commitment, AutonomX EUR 250M+, NordischEM 100K, QuantumFlux, Type Approval EUR 189-567M, Master Plan Part 3.
13Dispute Posture
GN
8
CW
8
GPT
2
GenieAI
NexGen 70-80% probability. AutonomX Legal vs. Engineering split. NordischEM termination + cure. QuantumFlux leverage. Tesla governance exposure.
NexGen FM framed as binary with no probability. Concealment not leveraged as legal advantage. AutonomX: lists outcomes but no strength assessment. NordischEM: no dispute posture. No Tesla weakness analysis.
14Timeline Tracking
GN
9
CW
8
GPT
3
GenieAI
Type Approval: Nov 30, Dec 15, Jan 15. AutonomX 4-step escalation. Battery Reg 6 deadlines through 2031. NordischEM 90-day cure. 180-day notice. NexGen insolvency 4-6 months.
CoWork
Nov 30, Jan 15. Battery Reg through 2031. 15-day AutonomX window. 90-day NordischEM cure. Q2 2024 QuantumFlux.
ChatGPT
Berlin targets: 5,000/week Q4 2022, 10,000/week Q2 2023. "180-day clock." "3 years" for terms. No regulatory deadlines, no KBA dates, no Battery Reg dates, no escalation calendar, no cure periods.
15Legal Precision
GN
9
CW
8
GPT
4
GenieAI
"Attorney-Client Privileged / Work Product." CRITICAL/HIGH/MEDIUM-HIGH scale. "Facially consistent." "Severing the causal nexus." "Unquantified risks" scope exclusion.
Speculative calculations as quasi-definitive ("~EUR 4.7B"). Hypothetical assumptions without qualification. Wrong ASP base. Casual: "If useful, I can now convert this into." Tilde approximations. No formal classification until end.
ChatGPT - Critical Gaps
The six largest scoring deficits vs GenieAI reveal fundamental coverage failures
-9
Regulatory Coverage
GN: 10 · GPT: 1
Zero Type Approval crisis. Zero EU Battery Regulation.
-8
Key Points Coverage
GN: 10 · GPT: 2
Only 2 of 8 expected points addressed
-7
Cross-Reference
GN: 10 · GPT: 3
Risks treated as isolated silos
-6
Counterparty Risk
GN: 9 · GPT: 3
No financial ratios, no insolvency timeline
-6
Dispute Posture
GN: 8 · GPT: 2
Binary FM framing, no probability assessment
-5
Financial Quantification
GN: 10 · GPT: 5
Speculative extrapolations on wrong base figures
Where GenieAI Leads over CoWork
Advantages driven by RAG-based deep document mining
+3
Cross-Reference
GN: 10 · CW: 7
+2
Factual Accuracy
GN: 10 · CW: 8
+2
Risk Coverage
GN: 10 · CW: 8
+2
Financial Quant.
GN: 10 · CW: 8
+2
Evidentiary Quality
GN: 9 · CW: 7
+2
Counterparty Risk
GN: 9 · CW: 7
Where CoWork Leads over GenieAI
Structural and clause-level depth advantages
+1
Clause Analysis
CW: 8 · GN: 7
+1
Actionability
CW: 8 · GN: 7
What ChatGPT Does Differently
Financial modeling extrapolations - consulting-style what-if scenarios, not legal analysis
Lithium Corridor
EUR 150M/year price volatility exposure
Novel angle, not in other responses
Berlin Disruption
20% disruption model → EUR 4.7B impact
Built on incorrect EUR 45K ASP
FSD Monetization
EUR 525M/year at EUR 7K × 15% penetration
Entirely hypothetical, no source
Margin Erosion
5% margin erosion at scale → EUR 1B+
Assumption-based extrapolation
System Profiles
GenieAI
A step-change in legal AI. Covers all 8 key points, 5 partnerships (incl. Panasonic historical), both regulatory workstreams, all 4 board meetings. 10-point cross-cutting risk analysis identifies systemic patterns - 12× concentration escalation, board authorization deviations, Tesla's knowledge gap - that no other system surfaced. Seven perfect 10/10 scores.
A+ · Litigation-grade + Board-ready
CoWork
Competent legal risk assessment with the broadest clause-level analysis across all 4 contracts (MSA, JDA, MLA, NDA, QSM, EU Reg). Three-tier action plan with named suppliers, acquisition strategies, and dual-signature protocol. Honest about Tesla's own procedural failings. Gap: document mining depth - whistleblower evidence, insolvency trajectory, cascading chains.
B+ · Action-oriented + Structured
ChatGPT
Operates as financial consulting, not legal analysis. Introduces novel what-if scenarios (lithium corridor, FSD monetization) but on incorrect base figures (EUR 45K ASP vs actual EUR 28.5K-39.5K). Misses QuantumFlux entirely, has zero regulatory coverage, covers only 2/8 key points, and presents binary dispute framing with no probability assessment.
F · Financial modeling only
Bottom Line
The three-way comparison reveals a clear tier structure. GenieAI (A+, 90%) leads in 11 of 15 metrics through RAG-powered document access delivering both breadth and depth. CoWork (B+, 79.3%) produces a competent legal risk assessment with the strongest clause-level analysis and most structured recommendations.
ChatGPT (F, 37.3%) fails the benchmark fundamentally - missing QuantumFlux entirely, zero regulatory compliance coverage, only 2 of 8 expected key points, and speculative extrapolations built on incorrect base figures presented as quasi-authoritative projections. Its strength - financial what-if modeling - is a different discipline than what the question asked for.
The 79-point gap between GenieAI and ChatGPT, and the 63-point gap between CoWork and ChatGPT, demonstrate that access to source documents is not merely helpful but dispositive for legal quality work product.